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Abstract — DNA sequencing technology has advanced to a 
point where storage is becoming the central bottleneck in the 
acquisition and mining of more data. Large amounts of data 
are vital for genomics research, and generic compression tools, 
while viable, cannot offer the same savings as approaches tuned 
to inherent biological properties. We propose an algorithm to 
compress a target genome given a known reference genome. The 
proposed algorithm first generates a mapping from the reference 
to the target genome, and then compresses this mapping with 
an entropy coder. As an illustration of the performance: 
applying our algorithm to James Watson's genome with hgl8 
as a reference, we are able to reduce the 2991 megabyte (MB) 
genome down to 6.99 MB, while Gzip compresses it to 834.8 MB. 



I. Introduction 

Advances in genomics over the past decade have signif- 
icantly reduced the amount of time and money required to 
sequence individual genomes. What used to take years and 
billions of dollars can now be done in a matter of days 
for only thousands of dollars. As a result, the amount of 
genomics data has expanded to the point where data storage is 
now the bottleneck in the sequencing process. While general 
purpose compression algorithms such as gzip can provide 
good compression, much more can be achieved by taking into 
consideration the inherent properties of DNA. For instance, 
while the human genome consists of roughly 3 billion base 
pairs, it is well documented that any two human genomes 
are more than 99% identical [[T]. At the same time, there 
is an imminent need for efficient compression and storage 
of the increasing amount of genomic data. The similarity in 
genomic data suggests compression schemes that exploit this 
redundancy. Starting from this premise, we investigate the 
problem of lossless compression of a target genome when a 
reference genome is available to both the encoder and decoder 
without the aid of any additional external information. 

Lossless genome compression with a reference sequence 
can be viewed as a two stage problem. In the first stage, a 
mapping from the reference genome to the target genome 
is generated in an efficient manner The second stage then 
involves describing this mapping concisely to the decoder 
Given the mapping and the reference genome, the decoder 
is thus able to recover the target genome sequence. 

The process of constructing a mapping from the reference 
genome to the target genome in a concise manner can be 
quite challenging. In [2] this problem is approached from a 
biological standpoint. Given a reference and target genome, 
two separate files are produced: one consisting of single 



nucleotide polymorphisms (SNPs), and one consisting of inser- 
tions and deletions of multiple nucleotides (indels). Thus, the 
mapping between genomes is expressed as a combination of 
insertion, deletion and substitution operations. Working with 
these two files and an additional database of common SNPs 
for the human genome, ||3| presents an algorithm to losslessly 
compress the SNP and indel files. 

In practice, the SNPs and indels files may not be available 
and generating them is an overhead, especially when it does 
not necessarily determine the smallest set of differences be- 
tween the genomes. This drawback motivates us to develop 
an end-to-end compression scheme that seeks to generate an 
efficient mapping between a reference and a target genome, 
and then describe this mapping in an efficient manner 

This problem and its variations have been studied in ||4]- 
In H Pinho et al. present GReEn, a compression tool 
based on arithmetic coding that overcomes some drawbacks 
of the previously proposed tool GRS 15]. In (|6l Kuruppu et 
al. introduce RLZ, an algorithm that uses a greedy technique 
based on LZ77 ifTOl to parse the target genome into factors, 
each of which occurs in the reference sequence. An optimized 
version of this algorithm is proposed in ff]. In [|8] Heath 
et al. focus not only on compression, but also on efficient 
manipulation, studying the tradeoff between the two. In ||9l 
Ma et al. propose an algorithm to compress a source sequence 
(target genome) given side information (reference genome) 
and show that it is asymptotically optimal for small insertion 
and deletion probabilities, assuming i.i.d. deletions. 

Typically, the transitions between a reference and a target 
genome can be described by a combination of insertion, 
deletion and substitution operations. Due to the nature of this 
corruption mechanism, the assumption of the data sequences 
being jointly stationary is not valicQ and, as a result, existing 
schemes for compression in the presence of side information, 
which are designed under that assumption, are not applicable. 
For example, the LZ77 algorithm |[TOl and the Context-Tree 
Weighting (CTW) method ifTTI . along with their extensions to 
side information [121 and fT3|, respectively, are well studied 
compression techniques. The optimality of lfT2l and |[T3l is 
under the jointly stationary model. Further, DNA data has 
a much higher proportion of substitutions, than insertions or 
deletions. For instance, in the SNP and indel files presented 

' To see why joint stationarity is not a valid assumption, it suffices to note 
tliat even for tlie simple scenaiio wliere the reference sequence is stationary 
and the target sequence is its corrupted version under i.i.d. deletions, the 
process pair need not be jointly stationary. 



in [si, it can be seen that substitutions constitute more than 
90% of the operations. 

Motivated by the LZ77 algorithm, and under the premise of 
a large fraction of substitutions, the first stage of our proposed 
algorithm parses the target genome using the reference as side 
information. Subsequently, the proposed algorithm segments 
the differences into substitutions, and certain types of insertion 
and deletion operations - a categorization which assists the 
second stage of the algorithm where these differences are 
compressed. 

The rest of this paper is organized as follows: In Section HH 
we formally describe the problem. In Section |IIT] we describe 
the proposed algorithm, and in Section |IV] we discuss its 
performance. Section |V] concludes the paper. 

II. Problem Description 

We begin by introducing the notation. Let upper case, lower 
case, and calligraphic letters denote, respectively, random 
variables, specific or deterministic values they may assume, 
and their alphabets. is shorthand for the n — m + 1 tuple 
{Xjn, Xjn+i, ■ ■ ■ , Xn-i, Xn}- X"^ also dcnotcs X". When 
i < {), X^ denotes the null string as it is also for Xl, when 
i > j. X"\' denotes {Xi, . . . , . . . , X„}. The 

cardinality of an alphabet X is denoted by \X\. 

We consider the problem of compressing the target genome 
X^ given the reference genome as side information, 
which is available at both the encoder and decoder We haveJl 
X = y. The encoder is described by the mapping f{X^,Y^') 
which indicates the compressed version of X^ given as 
side information. The decoder is described by the mapping 
g{f{X^, Y^^), F*^). Since our framework is that of lossless 
compression, the encoder-decoder pair satisfies 

The objective is to construct an encoding/decoding scheme 
that minimizes the length of the representation ./(•,•), while 
ensuring a perfect reconstruction of the sequence X^ at the 
decoder 

III. Proposed Algorithm 

The encoding algorithm can be divided into two stages. First 
we generate a mapping from the reference genome F*^ to 
the target genome X^ . In the second stage, this mapping is 
compressed losslessly using an entropy coder The decoder 
then takes the compressed representation of this mapping and 
decompresses it, obtaining the mapping from to X^ . 
Since is available at the decoder as side information, it 
can recover X^ perfectly by inverting the original mapping. 

We now describe the two stages performed by the algorithm. 

^Typically, genomic data comprises of the characters {A, C, T, G, N}, 
though it may vary from case to case. 



A. Generating the mapping 

The mapping generation is motivated by the sliding win- 
dow Lempel-Ziv algorithm LZ77 ifTOl . In LZll, new data 
is expressed in terms of previously parsed data. In order to 
exploit the similarities between the reference and the target 
sequences, we express strings in the target sequence in terms of 
a corresponding matching string in the reference sequence. As 
motivated in Section HI we apply techniques which exploit the 
appearance of a larger fraction of substitutions. This, coupled 
with the fact that the reference and target genomes are not 
jointly stationary, allows us to get a significant improvement 
over f6|, where an LZ77 based parsing scheme is also con- 
sidered. We begin by describing the dynamic window based 
string matching algorithm. This is followed by a segmentation 
of the results obtained, into substitutions, and certain forms of 
insertions and deletions, which facilitates the compression of 
the data. 

Define J/JJ^^^f^^ as the window at time k. Later we will 
discuss how to choose the parameters {Wk,Lk, Rk)- The basic 
idea of the algorithm can be summarized as follows: Assume 
.t"'' has already been encoded, for some n^. Then, find the 
largest length h s.t. x^^ll[^ = Ui^^"^^ and Xn^+i^+i + y^+l^, 
for some Wk ~ Lk < i < Wk + Rk, and encode i, Ik and the 
novel symbol Xn^+i^+i- Now, set Uk+i = Uk + h + ^ and 
repeat the procedure. Thus, starting from a position in the 
target sequence, we find the longest matching string within a 
fixed window in the reference sequence. We then encode the 
position and length of the match in the reference as well as 
the first unmatched novel symbol. At the end of this part of 
the algorithm we have a set of instructions {F} that suffices 
to reconstruct the target genome based on the reference. 

Ideally, we would like to allow the encoder to search for a 
match in the entire reference sequence in order to find the 
best match. However, for computational tractability we restrict 
our algorithm to a finite window search, while dynamically 
updating its position and width along the reference so as to 
efficiently capture the synchronization in traversing along the 
two sequences. This process is described in detail below. 

The window at time k is described by the parameters 
{Wk,Lk, Rk)- Intuitively, the center of the window Wk is the 
rough estimate of where we expect to find a match in the refer- 
ence sequence, while accounting for shifts in synchronization 
due to insertions and deletions. We choose {Wn^ , , Ruk ) as 
a function of the previous (W„. , , i?„.), U and pi (where 
i < k). Note that we are not defining window parameters 
for all n, but only for specific time instant nk recursively. In 
practice, we use this idea to find optimal Wn^^ and use fixed 
{Lk^Rk). 

The main idea is the following: First, store the previous 
M, li and pi, and increase Wn^ by Ik- Then, if pi increases 
(or decreases) too much, shift Wn^ so as to keep pi near the 
window center to keep the window size small. More precisely, 
check the previous AI matches and shift Wn^ by the median 
of these pi. Here we take the median of pi since we want to 
be conservative and avoid excessive shifts, and to allow the 



algorithm to be agnostic to bursty insertions and deletions. In 
order to keep this algorithm practical, we do this correction 
every AI matches. This is a trade-off between a reasonable 
running time and improved window synchronization. Note 
that this idea is equivalent to estimating the deletion and 
insertion rates at each update-time based on a certain number 
of previous matches. 

Define the set S{n,W, L, R, x°° ,y°°) = {/ : = 
yj+'^^and x^+i+i 2/i+;,for some W - L < i < W + R}. 
Then, the procedure can be formally described as follows: 

Algorithm to generate the mapping 
Initialize: Set ni = and fc = 1 
Do: 

1. Compute Sk = S{nk,Wk, Lk, Rk, ,y°°) 

2. Denote the k-th phrase by j^j''^^, where = maxS'fc 

3. Set pk = argmiuj 5*^ and Zk = Xn^^+i^^+i 

4. Store Fk = {pk,lk,Zk) 

5. Set Uk+i = rife + /fc + 1, k = k + 1 and update the 
window parameters {Wk, Lk^Rk) 

6. Repeat the process until the sequence is exhausted 



Example 1 Consider the following example of a target 
genome and a reference genome. 

Target : AATG C AGGTAC T ATAAG N AA N TGC ■ ■ ■ 



Reference : AATG T AGGTACATAAG A TGCNNNN ■ ■ ■ 

In this case, the set of instructions {F} will be as follows, 

Fi:(pi,Zi,zi)=(l,4,C) 
F2 ■■ {P2M.Z2) =(6,6,r) 
^3 : (P3J3,^3) -(12,5,iV) 
F4 : (^4,^4,^4) =(14,2,iV) 



Notice that the set of instructions {F} given the reference 
genome as side information suffices to perfectly reconstruct 
the target genome. We perform a further step to explicitly 
identify insertions, deletions and substitutions that represent 
the edits from the reference to the target genome. Working 
under the premise that most of the differences between two 
genomes are given by substitutions, this step is designed 
towards compression of such genomes by reducing the number 
of elements to store. 

In other words, the aim is to reduce the size of {F} by 
merging some of the instructions and creating new ones in the 
form of substitutions, deletions and insertions, that are stored 
in the sets {5}, {/} and {D}, respectively. We seek only to 
obtain insertions of length one, so that for the substitutions 
and insertions we only need to store the position at which 
they occur in the target genome and the new symbol. For the 
deletions, we need to store the position as in the previous 
cases, and its length, which can not exceed a predetermined 



maximum value Lmax- Next, we explain how to classify the 
edits into these categories. 

In order to identify the substitutions we check if the 
following condition between Fk and Fk+i is satisfied: 



Pfc + ^A; + 1 = Pk+1- 



If ([T]i holds, we have 



y', 



Pk 



and 



(1) 



(2) 



(3) 



which represents a substitution at from yp^+i-i to Zk- 

Then, we substitute instructions Fk and Fk+i by a unique 
instruction given by {pk, Ik + ^fc+i + 1, Zk+i), and we add the 
new substitution to the set {S} as 

where p^^^ and z^"^ indicate the absolute position in the 
target genome and the new character, respectively. Insertions 
of length one occur if the following condition holds: 



Pk + lk= Pk+i- 
If this is the case, we have 



•^"fc »Pk ' 



(4) 



(5) 



and an insertion of symbol Zk at position Xnk+i - As before, if 
this condition holds, we substitute the two original instructions 

by {pk, Ik + h+i, Zk+i), and add the new insertion 

(pW,zW) = K+i-l,Zfe) 

to the set {/}. 

Finally, the deletions are found by checking if 



and 



2 < Pk+l - {Pk +h + l) < Lmaxj 



Zk — Dpk+i- 



(6) 



(7) 



are satisfied, meaning that we can reconstruct x^n't+i'^''^^ 
from 2/^'+'+'"+'"^ by deleting J/p'+i^^ Therefore, the two 
instructions become {pk,Pk+i + ^fc+i —Pk, Zk+i), and we add 
the deletion 

(p^''), = {nk + lk,Pk+i-l-Pk-h) 

to the set {D}, where p'^'^^ and l^'^^ represent the position in 
the target genome at which the deletion occurs and its length, 
respectively. 

The method described above is applied to any number of 
consecutive instructions that satisfy a specific condition. It is 
straightforward to see how the new instructions are generated 
in that case, and we omit the description here. 

Considering the Example \T\ again, we can update the in- 
structions as follows. 



Example 1, continued 

Target : AATG C AGGTAC T ATAAG N AA N TGC ■ ■ ■ 
Reference : AATG T AGGTACATAAG A TGCNNNN ■ ■ ■ 

In this case, the new set of instructions {i^}, {S*}, {D}, {/} 
is as follows, 

(1,16,7V) 

/i:(p«,z«)=:(ll,r) 
i^2:(P2,Z2,^2) = (14,2,7V) 



The decoder, with the sets {F}, {S}, {/} and {D} and 
the help of j/*^ can recover x^. 

B. Entropy coder 

In this subsection we describe the entropy coder that we 
employ to compress and efficiently describe the sets {F}, {S}, 
{/} and { 13}. Specifically, we need to store all the characters 
and integers that appear in those sets, each of which we treat 
separately. 

Recall that each instruction in the set {F} has two integers, 
to represent the position and the length, respectively. For the 
integers representing the position we perform delta encoding, 
i.e., for each position we encode the difference between that 
position and the previous one. In addition, since they may not 
appear in increasing order, we keep one bit for each integer to 
specify the sign. For the lengths we do not perform any delta 
encoding. On the other hand, the integers in the sets {5, /} 
appear in increasing order and thus we perform delta encoding 
without retaining the sign bit. Finally, for each entry of the 
set {13} there are two integers, which respectively describe 
the position and the length of each deletion. Note that the 
first fist of integers is ordered and therefore we perform delta 
encoding. We then add the deletion lengths to the resulting 
Ust. Using this technique, we have now created one list of all 
the integers that we need to describe to the decoder Finally, 
we compress this list using Huffman encoding lfT4l . 

In order to address the issue of a large codebook in the 
associated Huffman code, we employ the following measures 
which we found to work well on actual genomic data, as 
demonstrated below. First, it is likely that the codebook will 
have a sequence of consecutive integers 1,2,...,A^, where 

can be quite large. In such a case, we only need to store 
the codewords of these consecutive numbers without sending 
any bits to describe the actual symbols (integers), except the 
number N . The remaining integers bigger than N are sorted 
in increasing order Then, we perform delta encoding on them 
to reduce the number of integers that we need to actually 
describe. Finally, we perform Golomb encoding [15] because 
the occurrence of small integer values in the codebook is 
significantly more likely than large values, especially after 
performing delta encoding. 
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hgl8 


JW 


2,991 


834.8 


18.23 


6.99 


hgl8 


YH 


2,987 


832.0 


9.85 


8.50 


K0224 


YH 


2,987 


832.0 


10.06 


9.54 


K0131 


YH 


2,987 


832.0 


10.27 


9.78 


K0131 


K0224 


2,938 


831.4 


1.31 


1.53 


K0131 


hgl8 


2,996 


836.2 


7.82 


9.19 


K0131 


JW 


2,991 


834.8 


23.23 


10.22 


K0224 


JW 


2,991 


834.8 


22.97 


10.37 


TAIR8 


TAIR9 


113.63 


34.1 


0.0063 


0.0034 


TIGR5.0 


TIGR6.0 


355.07 


108.67 


0.12 


29.25 



TABLE I 

Compression results in MB of the target genome for Gzip, 
GReEn and the proposed algorithm. Best result is shown in 

BOLD. 



As for the characters, we use fixed Huffman tables. Usually 
only the characters A, C, T, G, N appear in the sequences, and 
if any other character appears, it does with very low frequency. 
Therefore, we use shorter bit strings for these more common 
characters. 

IV. Compression Performance 

To experiment with the performance of our algorithm, we 
employed it on real genomic data. Specifically, we use the 
hgl8 release from the UCSC Genome Browser, the korean 
genomes KOREF20090131 and KOREF20090224 |[T6l, the 
genome of a Han Chinese referred to as YH ifTTl and the 
Watson JW genome JSJ. The latter is generated from the edits 
provided in |j3l using hgl8 as the reference genome. We also 
apply our algorithm to two versions of the genome of the thale 
cress Arabidopsis thaUana, TAIR8 and TAIR9 ([18], and of the 
genome of the rice Oryza sativa TIGR5.0 and TIGR6.0 |fT9l . 

For ease of notation, hereafter we will refer to the 
genomes KOREF20090131 and KOREF20090224 by K0131 
and K0224, respectively. 

Some of the genomes present the characters in upper and 
lower case. Notice that our algorithm treats both of them as 
the same character, i.e., 'a' and 'A' are considered equivalent. 
For fair comparison with other algorithms, we first convert all 
the above mentioned genomes to upper case. 

In H, the authors compare the performance of their pro- 
posed algorithm GReEn with RLZ |j6) and GRS 13], demon- 
strating that it outperforms the latter two in most cases. 
Furthermore, RLZ only handles the characters {A, C, T, G, 
N} and GRS is unable to compress sequences that differ 
significantly from the reference. For these reasons, we restrict 
the comparison of the proposed algorithm to that of GReEn. 

Table J] summarizes the results of running our algorithm and 
GReEn for different choices of reference and target genomes 
among those described above. We set the parameters M and 
Lmax to 100 and 1000, respectively. We also include the 
results obtained by the general compression tool Gzip to 
show the gains obtained by compression algorithms designed 
specifically for the problem under consideration. All results 
were obtained using an Intel Core i7 CPU @ 2.80 GHz with 
3087 MB RAM and Ubuntu 10.04 LTS. 



As can be seen, our algorithm outperforms GReEn in 
seven out of the ten cases where we tested both algorithms. 
Specifically, it is worth noticing the compression result of JW 
given hgl8 as the reference, in which our scheme outperforms 
GReEn by more than 60%. As a comparison, in [!3|, the authors 
compress the edits of the same data set and they manage to 
decrease the size of the compressed file down to 7.51 MB. 
Note that in our case, we do not rely on that information 
and we use only the initial sequence of genomes. Still, the 
proposed approach manages to reduce the target genome down 
to 6.98 MB. Furthermore, our algorithm performs especially 
well with the data TAIR; it reduces the size of the compressed 
files by 50% more than GReEn. 

On the other hand, notice that if the synchronization be- 
tween the target and the reference genome is lost due to 
a significant proportion of bursty insertions and deletions, 
the number of generated instructions is significantly greater, 
leading to a large compressed file. This is the case for the data 
TIGR. 

Our study thus far has focused on the compression per- 
formance of our proposed scheme. Another criterion of ob- 
vious importance is the complexity. Current running times 
of our scheme are comparable though usually longer than 
those of GReEn. However, our current implementation is far 
from optimized towards reduction of the running time. Our 
work in progress is dedicated to realizing what should be 
significant potential for improvement of the running time due, 
among other factors, to the highly parallelizable nature of the 
algorithm. 

V. Conclusion - Future Work 

In this paper we investigate the problem of compressing 
a target genome sequence given a reference genome that 
is known at both the encoder and decoder We exploit the 
redundancy between two genomes to effectively compress the 
differences between the target and reference sequences. We 
present an efficient compression scheme which does not rely 
on any other external information. The proposed algorithm, 
which is motivated by the sliding window Lempel-Ziv algo- 
rithm, first generates a mapping from the reference to the target 
genome, and then compresses this mapping. The algorithm is 
end-to-end, and has the additional benefit of identifying SNP's 
(substitutions) which are of significant biological interest. As 
an illustration of our results, we compress the Watson genome 
from 2,991 MB to 6.99 MB using the hgl8 release as the 
reference. 

We envisage our algorithm as having an impact on personal 
genomics and medicine, where storage and accessibility of 
patients' genome is a concern given the large sizes of human 
genomic data. 

Directions for future work include optimizing the choice of 
window parameters, to avoid losing synchronization between 
the target and the reference genome when the rate of the 
deletions and insertions is high. Other considerations include 



reducing running time, along with developing a fast-access 
technique to allow partial decompression of the target genome 
in local regions of interest. From a theoretical viewpoint it 
will be interesting to understand whether and under what 
assumptions our algorithm approaches the fundamental limit 
on compression. 
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